69 research outputs found

    Lightweight Asynchronous Snapshots for Distributed Dataflows

    Full text link
    Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots.Comment: 8 pages, 7 figure

    Spinning Fast Iterative Data Flows

    Full text link
    Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.Comment: VLDB201

    Implementing Parallel Differential Evolution on Spark

    Get PDF
    [Abstract] Metaheuristics are gaining increased attention as an efficient way of solving hard global optimization problems. Differential Evolution (DE) is one of the most popular algorithms in that class. However, its application to realistic problems results in excessive computation times. Therefore, several parallel DE schemes have been proposed, most of them focused on traditional parallel programming interfaces and infrastruc- tures. However, with the emergence of Cloud Computing, new program- ming models, like Spark, have appeared to suit with large-scale data processing on clouds. In this paper we investigate the applicability of Spark to develop parallel DE schemes to be executed in a distributed environment. Both the master-slave and the island-based DE schemes usually found in the literature have been implemented using Spark. The speedup and efficiency of all the implementations were evaluated on the Amazon Web Services (AWS) public cloud, concluding that the island- based solution is the best suited to the distributed nature of Spark. It achieves a good speedup versus the serial implementation, and shows a decent scalability when the number of nodes grows.[Resumen] Las metaheurĂ­sticas estĂĄn recibiendo una atenciĂłn creciente como tĂ©cnica eficiente en la resoluciĂłn de problemas difĂ­ciles de optimizaciĂłn global. Differential Evolution (DE) es una de las metaheurĂ­sticas mĂĄs populares, sin embargo su aplicaciĂłn en problemas reales deriva en tiempos de cĂłmputo excesivos. Por ello se han realizado diferentes propuestas para la paralelizaciĂłn del DE, en su mayorĂ­a utilizando infraestructuras e interfaces de programaciĂłn paralela tradicionales. Con la apariciĂłn de la computaciĂłn en la nube tambiĂ©n se han propuesto nuevos modelos de programaciĂłn, como Spark, que permiten manejar el procesamiento de datos a gran escala en la nube. En este artĂ­culo investigamos la aplicabilidad de Spark en el desarrollo de implementaciones paralelas del DE para su ejecuciĂłn en entornos distribuidos. Se han implementado tanto la aproximaciĂłn master-slave como la basada en islas, que son las mĂĄs comunes. TambiĂ©n se han evaluado la aceleraciĂłn y la eficiencia de todas las implementaciones usando el cloud pĂșblico de Amazon (AWS, Amazon Web Services), concluyĂ©ndose que la implementaciĂłn basada en islas es la mĂĄs adecuada para el esquema de distribuciĂłn usado por Spark. Esta implementaciĂłn obtiene una buena aceleraciĂłn en relaciĂłn a la implementaciĂłn serie y muestra una escalabilidad bastante buena cuando el nĂșmero de nodos aumenta.[Resume] As metaheurĂ­sticas estĂĄn recibindo unha atenciĂłn a cada vez maior como tĂ©cnica eficiente na resoluciĂłn de problemas difĂ­ciles de optimizaciĂłn global. Differential Evolution (DE) Ă© unha das metaheurĂ­sticas mais populares, ainda que a sua aplicaciĂłn a problemas reais deriva en tempos de cĂłmputo excesivos. É por iso que se propuxeron diferentes esquemas para a paralelizaciĂłn do DE, na sua maiorĂ­a utilizando infraestruturas e interfaces de programaciĂłn paralela tradicionais. Coa apariciĂłn da computaciĂłn na nube tamĂ©n se propuxeron novos modelos de programaciĂłn, como Spark, que permiten manexar o procesamento de datos a grande escala na nube. Neste artigo investigamos a aplicabilidade de Spark no desenvolvimento de implementaciĂłns paralelas do DE para a sua execuciĂłn en contornas distribuidas. ImplementĂĄronse tanto a aproximaciĂłn master-slave como a baseada en illas, que son as mais comĂșns. TamĂ©n se avaliaron a aceleraciĂłn e a eficiencia de todas as implementaciĂłns usando o cloud pĂșblico de Amazon (AWS, Amazon Web Services), tirando como conclusiĂłn que a implementaciĂłn baseada en illas Ă© a mais acaida para o esquema de distribuciĂłn usado por Spark. Esta implementaciĂłn obtĂ©n unha boa aceleraciĂłn en relaciĂłn ĂĄ implementaciĂłn serie e amosa unha escalabilidade bastante boa cando o nĂșmero de nos aumenta.Ministerio de EconomĂ­a y Competitividad; DPI2014-55276-C5-2-RXunta de Galicia; GRC2013/055Xunta de Galicia; R2014/04

    Renal sympathetic denervation restores aortic distensibility in patients with resistant hypertension: data from a multi-center trial

    Get PDF
    Renal sympathetic denervation (RDN) is under investigation as a treatment option in patients with resistant hypertension (RH). Determinants of arterial compliance may, however, help to predict the BP response to therapy. Aortic distensibility (AD) is a well-established parameter of aortic stiffness and can reliably be obtained by CMR. This analysis sought to investigate the effects of RDN on AD and to assess the predictive value of pre-treatment AD for BP changes. We analyzed data of 65 patients with RH included in a multicenter trial. RDN was performed in all participants. A standardized CMR protocol was utilized at baseline and at 6-month follow-up. AD was determined as the change in cross-sectional aortic area per unit change in BP. Office BP decreased significantly from 173/92 ± 24/16 mmHg at baseline to 151/85 ± 24/17 mmHg (p < 0.001) 6 months after RDN. Maximum aortic areas increased from 604.7 ± 157.7 to 621.1 ± 157.3 mm2 (p = 0.011). AD improved significantly by 33% from 1.52 ± 0.82 to 2.02 ± 0.93 × 10-3 mmHg-1 (p < 0.001). Increase of AD at follow-up was significantly more pronounced in younger patients (p = 0.005) and responders to RDN (p = 0.002). Patients with high-baseline AD were significantly younger (61.4 ± 10.1 vs. 67.1 ± 8.4 years, p = 0.022). However, there was no significant correlation of baseline AD to response to RDN. AD is improved after RDN across all age groups. Importantly, these improvements appear to be unrelated to observed BP changes, suggesting that RDN may have direct effects on the central vasculature

    COVID-19 symptoms at hospital admission vary with age and sex: results from the ISARIC prospective multinational observational study

    Get PDF
    Background: The ISARIC prospective multinational observational study is the largest cohort of hospitalized patients with COVID-19. We present relationships of age, sex, and nationality to presenting symptoms. Methods: International, prospective observational study of 60 109 hospitalized symptomatic patients with laboratory-confirmed COVID-19 recruited from 43 countries between 30 January and 3 August 2020. Logistic regression was performed to evaluate relationships of age and sex to published COVID-19 case definitions and the most commonly reported symptoms. Results: ‘Typical’ symptoms of fever (69%), cough (68%) and shortness of breath (66%) were the most commonly reported. 92% of patients experienced at least one of these. Prevalence of typical symptoms was greatest in 30- to 60-year-olds (respectively 80, 79, 69%; at least one 95%). They were reported less frequently in children (≀ 18 years: 69, 48, 23; 85%), older adults (≄ 70 years: 61, 62, 65; 90%), and women (66, 66, 64; 90%; vs. men 71, 70, 67; 93%, each P &lt; 0.001). The most common atypical presentations under 60 years of age were nausea and vomiting and abdominal pain, and over 60 years was confusion. Regression models showed significant differences in symptoms with sex, age and country. Interpretation: This international collaboration has allowed us to report reliable symptom data from the largest cohort of patients admitted to hospital with COVID-19. Adults over 60 and children admitted to hospital with COVID-19 are less likely to present with typical symptoms. Nausea and vomiting are common atypical presentations under 30 years. Confusion is a frequent atypical presentation of COVID-19 in adults over 60 years. Women are less likely to experience typical symptoms than men

    Programming Abstractions, Compilation, and Execution Techniques for Massively Parallel Data Analysis

    No full text
    We are witnessing an explosion in the amount of available data. Today, businesses and scientific institutions have the opportunity to analyze empirical data at unpreceded scale. For many companies, the analysis of their accumulated data is nowadays a key strategic aspect. Today’s analysis programs consist not only of traditional relational-style queries, but they use increasingly more complex data mining and machine learning algorithms to discover hidden patterns or build predictive models. However, with the increasing data volume and increasingly complex questions that people aim to answer, there is a need for new systems that scale to the data size and to the complexity of the queries. Relational Database Management Systems have been the work horses of large-scale data analytics for decades. Their key enabling feature was arguably the declarative query language that brought physical schema independence and automatic optimization of queries. However, their fixed data model and closed set of possible operations have rendered them unsuitable for many advanced analytical tasks. This observation made way for a new breed of systems with generic abstractions for data parallel programming, among which the arguably most famous one is MapReduce. While bringing large-scale analytics to new applications, these systems still lack the ability to express complex data mining and machine learning algorithms efficiently, or they specialize on very specific domains and give up applicability to a wide range of other problems. Compared to relational databases, MapReduce and the other parallel programming systems sacrifice the declarative query abstraction and require programmers to implement low-level imperative programs and to manually optimize them. This thesis discusses techniques that realize several of the key aspects enabling the success of relational databases in the new context of data-parallel programming systems. The techniques are instrumental in building a system for generic and expressive, yet concise, fluent, and declarative analytical programs. Specifically, we present three new methods: First, we provide a programming model that is generic and can deal with complex data models, but retains many declarative aspects of the relational algebra. Programs written against this abstraction can be automatically optimized with similar techniques as relational queries. Second, we present an abstraction for iterative data-parallel algorithms. It supports incremental (delta-based) computations and transparently handles state. We give techniques to make the optimizer iteration-aware and deal with aspects such as loop invariant data. The optimizer can produce execution plans that correspond to well-known hand-optimized versions of such programs. That way, the abstraction subsumes dedicated systems (such as Pregel) and offers competitive performance. Third, we present and discuss techniques to embed the programming abstraction into a functional language. The integration allows for the concise definition of programs and supports the creation of reusable components for libraries or domain-specific languages. We describe how to integrate the compilation and optimization of the data-parallel programs into a functional language compiler to maximize optimization potentia

    Programmierabstraktionen, Übersetzung und AusfĂŒhrungstechniken fĂŒr massiv parallele Datenanalyse

    No full text
    Aufgrund fallender Preise zur Speicherung von Daten kann man derzeit eine explosionsartige Zunahme in der Menge der verfĂŒgbaren Daten beobachten. Diese Entwicklung gibt Unternehmen und wissenschaftliche Institutionen die Möglichkeit empirische Daten in ungekannter GrĂ¶ĂŸenordnung zu analysieren. FĂŒr viele Firmen ist die Analyse der gesammelten Daten aus ihrem operationalen GeschĂ€ft lĂ€ngst zu einem zentralen strategischen Aspekt geworden. Im Gegensatz zu der seit lĂ€ngerem schon betriebenen Business Intelligence, bestehen diese Analysen nicht mehr nur aus traditionellen relationalen Anfragen. In zunehmendem Anteil kommen komplexe Algorithmen aus den Bereichen Data Mining und Maschinelles Lernen hinzu, um versteckte Muster in den Daten zu erkennen, oder Vorhersagemodelle zu trainieren. Mit zunehmender Datenmenge und KomplexitĂ€t der Analysen wird jedoch eine neue Generation von Systemen benötigt, die diese Kombination aus AnfragekomplexitĂ€t und Datenvolumen gewachsen sind. Relationale Datenbanken waren lange Zeit das Zugpferd der Datenanalyse im großen Stil. Grund dafĂŒr war zum großen Teil ihre deklarativen Anfragesprache, welche es ermöglichte die logischen und physischen Aspekte der Datenspeicherung und Verarbeitung zu trennen, und Anfragen automatisch zu optimieren. Das starres Datenmodell und ihre beschrĂ€nkte Menge von möglichen Operationen schrĂ€nken jedoch die Anwendbarkeit von relationalen Datenbanken fĂŒr viele der neueren analytischen Probleme stark ein. Diese Erkenntnis hat die Entwicklung einer neuen Generation von Systemen und Architekturen eingelĂ€utet, die sich durch sehr generische Abstraktionen fĂŒr parallelisierbare analytische Programme auszeichnen; MapReduce kann hier beispielhaft genannt werden, als der zweifelsohne prominenteste Vertreter dieser Systeme. Zwar vereinfachte und erschloss diese neue Generation von Systemen die Datenanalyse in diversen neuen Anwendungsfeldern, sie ist jedoch nicht in der Lage komplexe Anwendungen aus den Bereichen Data Mining und Maschinelles Lernen effizient abzubilden, ohne sich dabei extrem auf spezifische Anwendungen zu spezialisieren. Verglichen mit den relationalen Datenbanken haben MapReduce und vergleichbare Systeme außerdem die deklarative Abstraktion aufgegeben und zwingen den Anwender dazu systemnahe Programme zu schreiben und diese manuell zu optimieren. In dieser Dissertation werden verschiedene Techniken vorgestellt, die es ermöglichen etliche der zentralen Eigenschaften von relationalen Datenbanken im Kontext dieser neuen Generation von daten-parallelen Analysesystemen zu realisieren. Mithilfe dieser Techniken ist es möglich ein Analysesystem zu beschreiben, dessen Programme gleichzeitig sowohl generische und ausdrucksstark, als auch prĂ€gnant und deklarativ sind. Im einzelnen stellen wir folgende Techniken vor: Erstens, eine Programmierabstraktion die generisch ist und mit komplexen Datenmodellen umgehen kann, aber gleichzeitig viele der deklarativen Eigenschaften der relationalen Algebra erhĂ€lt. Programme, die gegen dies Abstraktion entwickelt werden können Ă€hnlich optimiert werden wie relationale Anfragen. Zweitens stellen wir eine Abstraktion fĂŒr iterative daten-parallele Algorithmen vor. Die Abstraktion unterstĂŒtzt inkrementelle (delta-basierte) Berechnungen und geht mit zustandsbehafteteten Berechnungen transparent um. Wir beschreiben wie man einen relationalen Anfrageoptimierer erweitern kann so dass dieser iterative Anfragen effektiv optimiert. Wir zeigen dabei dass der Optimierer dadurch in die Lage versetzt wird automatisch AusfĂŒhrungsplĂ€ne zu erzeugen, die wohlbekannten, manuell erstellten Programmen entsprechen. Die Abstraktion subsumiert dadurch spezialisierte Systeme (wie Pregel) und bietet vergleichbare Performanz. Drittens stellen wir Methoden vor, um die Programmierabstraktion in eine funktionale Sprachen einzubetten. Diese Integration ermögliche es prĂ€gnante Programme zu schreiben und einfach wiederzuverwendenden Komponenten und Bibliotheken, sowie DomĂ€nenspezifische Sprachen, zu erstellen. Wir legen dar wie man die Übersetzung und Optimierung des daten-parallelen Programms mit dem SprachĂŒbersetzer der funktionalen Sprache so integriert, dass maximales Optimierungspotenzial besteht.We are witnessing an explosion in the amount of available data. Today, businesses and scientific institutions have the opportunity to analyze empirical data at unpreceded scale. For many companies, the analysis of their accumulated data is nowadays a key strategic aspect. Today’s analysis programs consist not only of traditional relational-style queries, but they use increasingly more complex data mining and machine learning algorithms to discover hidden patterns or build predictive models. However, with the increasing data volume and increasingly complex questions that people aim to answer, there is a need for new systems that scale to the data size and to the complexity of the queries. Relational Database Management Systems have been the work horses of large-scale data analytics for decades. Their key enabling feature was arguably the declarative query language that brought physical schema independence and automatic optimization of queries. However, their fixed data model and closed set of possible operations have rendered them unsuitable for many advanced analytical tasks. This observation made way for a new breed of systems with generic abstractions for data parallel programming, among which the arguably most famous one is MapReduce. While bringing large-scale analytics to new applications, these systems still lack the ability to express complex data mining and machine learning algorithms efficiently, or they specialize on very specific domains and give up applicability to a wide range of other problems. Compared to relational databases, MapReduce and the other parallel programming systems sacrifice the declarative query abstraction and require programmers to implement low-level imperative programs and to manually optimize them. This thesis discusses techniques that realize several of the key aspects enabling the success of relational databases in the new context of data-parallel programming systems. The techniques are instrumental in building a system for generic and expressive, yet concise, fluent, and declarative analytical programs. Specifically, we present three new methods: First, we provide a programming model that is generic and can deal with complex data models, but retains many declarative aspects of the relational algebra. Programs written against this abstraction can be automatically optimized with similar techniques as relational queries. Second, we present an abstraction for iterative data-parallel algorithms. It supports incremental (delta-based) computations and transparently handles state. We give techniques to make the optimizer iteration-aware and deal with aspects such as loop invariant data. The optimizer can produce execution plans that correspond to well-known hand-optimized versions of such programs. That way, the abstraction subsumes dedicated systems (such as Pregel) and offers competitive performance. Third, we present and discuss techniques to embed the programming abstraction into a functional language. The integration allows for the concise definition of programs and supports the creation of reusable components for libraries or domain-specific languages. We describe how to integrate the compilation and optimization of the data-parallel programs into a functional language compiler to maximize optimization potential

    POP/FED: Progressive Query Optimization for Federated Queries in DB2

    No full text
    Federated queries are regular relational queries accessing data on one or more remote relational or non-relational data sources, possibly combining them with tables stored in the federated DBMS server. Their execution is typically divided between the federated server and the remote data sources. Outdated and incomplete statistics have a bigger impact on federated DBMS than on regular DBMS, as maintenance of federated statistics is unequally more complicated and expensive than the maintenance of the local statistics; consequently bad performance commonly occurs for federated queries due to the selection of a suboptimal query plan. To solve this problem we propose a progressive optimization technique for federated queries called POP/FED by extending the state of the art for progressive reoptimization for local source queries, POP [4]. POP/FED uses (a) an opportunistic, but risk controlled reoptimization technique for federated DBMS, (b) a technique for multiple reoptimizations during federated query processing with a strategy to discover redundant and eliminate partial results, and (c) a mechanism to eagerly procure statistics in a federated environment. In this demonstration we showcase POP/FED implemented in a prototype version of WebSphere Information Integrator for DB2 using the TPC-H benchmark database and its workload. For selected queries of the workload we show unique features including multi-round reoptimizations using both a new graphical reoptimization progress monitor POPMonitor and the DB2 graphical plan explain tool.1
    • 

    corecore